ARROW-11733: [Rust][DataFusion] Implement hash partitioning#9548
ARROW-11733: [Rust][DataFusion] Implement hash partitioning#9548Dandandan wants to merge 11 commits into
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9548 +/- ##
=======================================
Coverage 82.28% 82.29%
=======================================
Files 244 244
Lines 55616 55659 +43
=======================================
+ Hits 45766 45804 +38
- Misses 9850 9855 +5
Continue to review full report at Codecov.
|
andygrove
left a comment
There was a problem hiding this comment.
This looks great @Dandandan. I agree that it makes sense to get the functionality working first and optimize later.
|
The integration failure looks like https://issues.apache.org/jira/browse/ARROW-11717 |
alamb
left a comment
There was a problem hiding this comment.
Looks good -- nice work @Dandandan
|
|
||
| let total_rows: usize = output_partitions.iter().map(|x| x.len()).sum(); | ||
|
|
||
| assert_eq!(8, output_partitions.len()); |
There was a problem hiding this comment.
would it make sense here also to assert on the distribution of rows (e.g. ensure that each batch has ~ 50*3 rows?
There was a problem hiding this comment.
Makes sense, but not sure how to do that currently, as it depends on random state (it could happen that all of them end up on same hash / partition in a very rare case).
|
@alamb from my side the PR is good to go |
|
Thanks @Dandandan -- looks great. |
This PR implements a first version of hash repartition.
This can used to (further) parallelize hash joins / aggregates or to implement distributed algorithms like
ShuffleHashJoin(https://github.com/ballista-compute/ballista/issues/595 )I didn't yet optimize for speed, as I think it makes sense to implement it and look for improvements later.
FYI @andygrove